Goto

Collaborating Authors

 tacotron 2


Supplementary Material of Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search Appendix A

Neural Information Processing Systems

The detailed encoder architecture is depicted in Figure 7. We design the grouped 1x1 convolutions to be able to mix channels. Figure 8c shows an example. The decoder gets a mel-spectrogram and squeezes it. The, the decoder processes it through a number of flow blocks.



Thanks all the reviewers for the detailed and thoughtful comments

Neural Information Processing Systems

Thanks all the reviewers for the detailed and thoughtful comments. HMM-based works [1, 2, 3], all of which proposed methods to estimate alignments from unsegmented data. We've not thoroughly explored to improve the duration predictor and simply follow the same We design the grouped 1x1 convolutions to be able to mix channels. For example, to generate a speech of 5.8 Therefore, adopting parallel TTS models significantly improves the sampling speed of end-to-end systems. In Section 5.3, we showed that varying temperature can change We will add a reference about Viterbi training.


Supplementary Material of Glow-TTS: A Generative Flow for T ext-to-Speech via Monotonic Alignment Search Appendix A

Neural Information Processing Systems

The detailed encoder architecture is depicted in Figure 7. We design the grouped 1x1 convolutions to be able to mix channels. Figure 8c shows an example. The decoder gets a mel-spectrogram and squeezes it. The, the decoder processes it through a number of flow blocks.



Thanks all the reviewers for the detailed and thoughtful comments

Neural Information Processing Systems

Thanks all the reviewers for the detailed and thoughtful comments. HMM-based works [1, 2, 3], all of which proposed methods to estimate alignments from unsegmented data. We've not thoroughly explored to improve the duration predictor and simply follow the same We design the grouped 1x1 convolutions to be able to mix channels. For example, to generate a speech of 5.8 Therefore, adopting parallel TTS models significantly improves the sampling speed of end-to-end systems. In Section 5.3, we showed that varying temperature can change We will add a reference about Viterbi training.



Text to Speech System for Meitei Mayek Script

Irengbam, Gangular Singh, Wahengbam, Nirvash Singh, Khumanthem, Lanthoiba Meitei, Oinam, Paikhomba

arXiv.org Artificial Intelligence

This paper presents the development of a Text-to-Speech (TTS) system for the Manipuri language using the Meitei Mayek script. Leveraging Tacotron 2 and HiFi-GAN, we introduce a neural TTS architecture adapted to support tonal phonology and under-resourced linguistic environments. We develop a phoneme mapping for Meitei Mayek to ARPAbet, curate a single-speaker dataset, and demonstrate intelligible and natural speech synthesis, validated through subjective and objective metrics. This system lays the groundwork for linguistic preservation and technological inclusion of Manipuri.


Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models

Lee, Kyowoon, Stitsyuk, Artyom, Jho, Gunu, Hwang, Inchul, Choi, Jaesik

arXiv.org Artificial Intelligence

Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-phoneme dictionaries, making it less practical in low-resource settings. We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation. Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality. This opens the door to inference-time refinement of TTS outputs without retraining, bridging the gap between pre-trained TTS models and editable speech synthesis.


Reviews: FastSpeech: Fast, Robust and Controllable Text to Speech

Neural Information Processing Systems

Originally: Although phoneme duration prediction is widely adopted in conventional TTS systems, jointly training it in a neural TTS model is new. This paper is one of the first works on non-autoregressive text-to-spectrogram modeling. Quality: This paper seems sound overall, expected for a few issues in the comments below. Some of these issues must be addressed before acceptance. Clarity: A well written paper. Significance: The advantages over its autoregressive counterparts are significant, especially for industrial use.